WebSelF: A Web Scraping Framework

نویسندگان

  • Jakob G. Thomsen
  • Erik Ernst
  • Claus Brabrand
  • Michael I. Schwartzbach
چکیده

We present WebSelF, a framework for web scraping which models the process of web scraping and decomposes it into four conceptually independent, reusable, and composable constituents. We have validated our framework through a full parameterized implementation that is flexible enough to capture previous work on web scraping. We conducted an experiment that evaluated several qualitatively different web scraping constituents (including previous work and combinations hereof) on about 11,000 HTML pages on daily versions of 17 web sites over a period of more than one year. Our framework solves three concrete problems with current web scraping and our experimental results indicate that composition of previous and our new techniques achieve a higher degree of accuracy, precision and specificity than existing tech-

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A Semantic Scraping Model for Web Resources - Applying Linked Data to Web Page Screen Scraping

In spite of the increasing presence of Semantic Web Facilities, only a limited amount of the available resources in the Internet provide a semantic access. Recent initiatives such as the emerging Linked Data Web are providing semantic access to available data by porting existing resources to the semantic web using different technologies, such as database-semantic mapping and scraping. Neverthel...

متن کامل

Penerapan teknik web scraping pada mesin pencari artikel ilmiah

Search engines are a combination of hardware and computer software supplied by a particular company through the website which has been determined. Search engines collect information from the web through bots or web crawlers that crawls the web periodically. The process of retrieval of information from existing websites is called "web scraping." Web scraping is a technique of extracting informat...

متن کامل

SMORE – Semantic Markup, Ontology, and RDF Editor

The promise of the Semantic Web is founded on the principle that online content will be semantically annotated, creating machine-understandable content using interlinking ontologies. In keeping with this principle, we introduce SMORE, the Semantic Markup, Ontology, and RDF Editor. It provides users with an integrated environment for creating web pages, email, and other online content while faci...

متن کامل

Data Driven Game Theoretic Cyber Threat Mitigation

Researchers at Arizona State University have created a data-driven game-based framework that models cyber attacker behavior. The key innovation of this invention is the combination of darkweb scraping of hacker exploit markets with game theory. This novel approach provides security analysts with a better understanding of the threat posed by zero-day exploits on the darkweb, and recommends decis...

متن کامل

Using Surveys and Web-Scraping to Select Tools for Software Testing Consultancy

We analyzed findings from data collected utilizing surveys and Webscraping, to support Knowit Oy, a software testing consultation company, in the process of selecting the right tools for software testing & test automation. We conducted two surveys (2013 & 2016) among (mostly Finnish) software professionals to acquire criteria and a list of tools used for software testing in industry. Considerin...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2012